df = read.csv("E:\\Linder_college\\Linear Regression\\dataset\\alumni.csv")
Summary Statistics for Percent of classes under 20:
summary(df$percent_of_classes_under_20)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 29.00 44.75 59.50 55.73 66.25 77.00
Summary Statistics for Alumni giving rate:
summary(df$alumni_giving_rate)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.00 18.75 29.00 29.27 38.50 67.00
plot(df$percent_of_classes_under_20, df$alumni_giving_rate, pch=20, xlab = "Percent of Classes Under 20",ylab = "Alumni Giving Rate", main = "Percent of Classes Under 20 VS Alumni Giving Rate")
cat("Correlation coefficient is:\n")
## Correlation coefficient is:
cor(df$percent_of_classes_under_20,df$alumni_giving_rate)
## [1] 0.6456504
hist(df$percent_of_classes_under_20, xlab = "Percent of Classes Under 20",main = "Histogram of Percent of Classes Under 20 ")
hist(df$alumni_giving_rate, xlab = "Alumni giving Rate", main = "Histogram of Alumni Giving Rate")
The data for the predictor and the response variable appears to be continiuos
boxplot(df$percent_of_classes_under_20,main = "Box Plot of Percent of Classes Under 20 ", ylab = "Percent of Classes Under 20")
boxplot(df$alumni_giving_rate, main = "Box Plot of Alumni Giving Rate", ylab = "Alumni Giving Rate")
We can infer from the plot above that there are no outliers
We observe that the data to for both the predictor and the response variable to be continiuos also we see that data has a positive slope, hence we can say that there is a positive correlation between both variables.
The data does not contain any outliers for the predictor and the response variable
However the data does appear to be scattered as we have the correlation coefficient to be 0.646. From the data we see that for the predictor variable (percent of classes below 20 percent) between 60 to 70 percent have a higher coefficient of correlation as compared to data from 30 to 60 percent.
plot(df$percent_of_classes_under_20, df$alumni_giving_rate, pch=20, xlab = "Percent of Classes Under 20",ylab = "Alumni Giving Rate", main = " Linear Regression Plot: Percent of Classes Under 20 VS Alumni Giving Rate")
abline(lm(df$alumni_giving_rate ~ df$percent_of_classes_under_20),lwd=1.5)
model1 = lm(formula = df$alumni_giving_rate ~ df$percent_of_classes_under_20)
model1
##
## Call:
## lm(formula = df$alumni_giving_rate ~ df$percent_of_classes_under_20)
##
## Coefficients:
## (Intercept) df$percent_of_classes_under_20
## -7.3861 0.6578
The estimated regression equation is Y = -7.3861 + 0.6578X
summary(model1)
##
## Call:
## lm(formula = df$alumni_giving_rate ~ df$percent_of_classes_under_20)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.053 -7.158 -1.660 6.734 29.658
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.3861 6.5655 -1.125 0.266
## df$percent_of_classes_under_20 0.6578 0.1147 5.734 7.23e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.38 on 46 degrees of freedom
## Multiple R-squared: 0.4169, Adjusted R-squared: 0.4042
## F-statistic: 32.88 on 1 and 46 DF, p-value: 7.228e-07
We see that the for increase in percent of classes under 20 there will be an increase in the response variable alumini_giving_rate by the factor of 0.6578.Also We get the p value to be less than 0.05, hence the variable percent_of_classes_under_20 is statistically significant to predict the rate of alumni giving donation.
Since the residual error is 10.38 we can say that our model has relatively good fit and can be used for prediction of alumni giving donations.
The multiple R-squared value of 0.4169 reveals that approximately 41.69% of the variance in the response variable, alumini_giving_rate, can be accounted for by the predictor variable percent_of_classes_under_20.
Additionally, the F-statistic of 32.88, coupled with a p-value less than 0.05, signifies that the model as a whole holds strong statistical significance in elucidating the variations observed in alumini_giving_rate.
Moreover, the adjusted R-squared, at 0.40, indicates that around 40% of the variability in alumini_giving_rate is explained by the predictor variable percent_of_classes_under_20 within our model.
set.seed(7052)
x = rnorm(n = 100, mean = 2, sd = 0.1)
error_data = rnorm(n = 100, mean = 0, sd = 0.5)
y = 10 + 5*x + error_data
cat("Summary Statistics for X\n")
## Summary Statistics for X
summary(x)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.725 1.923 2.001 2.004 2.070 2.243
cat("\nSummary Statistics for Y \n")
##
## Summary Statistics for Y
summary(y)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.09 19.67 20.11 20.17 20.70 21.80
hist(x, xlab = "Predictor Variable X", main = "Histogram of Predictor Variable X")
hist(y, ylab = "Response Variable Y", main = "Histogram of Response Variable Y")
cat("Correlation coefficient for data \n")
## Correlation coefficient for data
cor(x,y)
## [1] 0.8042198
Scatter Plot
plot(x,y, pch=20, xlab = "Predictor Variable X", ylab = "Response Variable Y", main = "Predictor Variable X VS Response Variable Y" )
boxplot(x,xlab = "Predictor Variable X", main = "Box Plot of Predictor Variable X" )
boxplot(y, ylab = "Response Variable Y", main = "Box Plot of Response Variable Y")
As we have the correllation coefficient to be pretty close to 1 we can say that there is a strong relation between the X and Y variable, also we donot see any outliers present in the data for X and Y.
model = lm(formula = y ~ x)
model_summ = summary(model)
model_summ
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.2073 -0.3029 0.0093 0.3033 1.3545
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.0218 0.8336 10.82 <2e-16 ***
## x 5.5652 0.4155 13.39 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4509 on 98 degrees of freedom
## Multiple R-squared: 0.6468, Adjusted R-squared: 0.6432
## F-statistic: 179.4 on 1 and 98 DF, p-value: < 2.2e-16
The estimated regression equation is Y = 9.0218 + 5.5652X
The estimated coefficients are : intercept: 9.0218 , X: 5.5652
mean(model_summ$residuals^2)
## [1] 0.1992276
The model mean squared error (MSE) is 0.1992
x_mean = mean(x)
y_mean = mean(y)
cat("Sampe mean of X",x_mean)
## Sampe mean of X 2.003677
cat("\n")
cat("Sampe mean of Y",y_mean)
## Sampe mean of Y 20.17258
plot(x, y, pch=20, xlab = "Predictor Variable X", ylab = "Response Variable Y", main = "Liner Regression Plot: X VS Y")
abline(lm(y~x),lwd=1.5)
points(x_mean, y_mean,col = 'red',pch=20)
Here we can observe the the point (X¯,Y¯) lie on the regression line,
this implies that the regression model is well-fitted and accurately
represents the relationship between these two variables. In this
scenario, the regression line effectively captures the central tendency
of the data.